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Abstract 

We explore the behavior of wind speed over time, using the Eastern Wind Dataset 
published by the National Renewable Energy Laboratory. This dataset gives wind 
speeds over three years at hundreds of potential wind farm sites. Wind speed anal¬ 
ysis is necessary to the integration of wind energy into the power grid; short-term 
variability in wind speed affects decisions about usage of other power sources, so 
that the shape of the wind speed curve becomes as important as the overall level. 

To assess differences in intra-day time series, we propose a functional distance mea¬ 
sure, the band distance, which extends the band depth of Lopez-Pintado and Romo 
(2009). This measure emphasizes the shape of time series or functional observations 
relative to other members of a dataset, and allows clustering of observations with¬ 
out reliance on pointwise Euclidean distance. To emphasize short-term variability, 
we examine the short-time Fourier transform of the nonstationary speed time series; 
we can also adjust for seasonal effects, and use these standardizations as input for 
the band distance. We show that these approaches to characterizing the data go 
beyond mean-dependent standard clustering methods, such as k-means, to provide 
more shape-influenced cluster representatives useful for power grid decisions. 

Keywords: Depth statistics; Distance metrics; Cluster analysis; Functional data; Time- 
frequency analysis; Wind power. 
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1 Introduction 


A key concern in power engineering is the characterization of the behavior of wind speed (or 
power output) over time, at a number of wind turbine locations, using only a small number 
of representatives. These representatives can then used to characterize wind speed behavior 
for training and testing power system algorithms that incorporate wind generation, such as 
the SuperOPF (Optimal Power Flow) system described by Murillo-Sanchez et ah (2013), 
which aims to make optimal decisions about activating generators and dispatching power. 
The number of representatives is generally constrained by the computational complexity 
of the algorithm using the data, so we treat this number as hxed in the discussion below. 

Wind speed data for thousands of locations are produced by NREL, and available 
through the EWITS database (see http://www.nrel.gov/electricity/transniission/ 
eastern_wind_dataset.html). Each site in the database has location information and 
wind speed measurements at ten-minute time increments for three years, allowing us to 
consider the data as either a high-frequency time series or functional data. In this analysis, 
we will look at wind speed in units of days, each day being a single time series from midnight 
to midnight (whose length is 144, since we use ten-minute time increments). We choose this 
time frame in the context of making day-ahead decisions about unit commitment (generator 
activation); but the methods described here can be applied to time series of different length, 
or different resolution on the time axis, as is desired in some applications. For example. 


Pourhabib et al. (2015) perform forecasting with a one-hour time resolution and a six-hour 
window, citing six hours as a typical cutoff for “short-term” wind forecasts, beyond which 
meterologically-based models may be preferred to data-driven approaches. We must also 


consider the nonstationarity of wind speed when viewed as a time series; Pourhabib et al. 
(2015) take a coarsely-grained approach, defining “epochs” such as “6pm to 12 am, all days 
in January,” while we will explore a more high-resolution time-frequency analysis. 

An effective set of representative days must cover the range of wind behaviors as far 
as possible, and include information about the relative probabilities of seeing each type of 
day. Clustering the data, then, is a natural way to capture different recurrent behaviors: 
the cluster centers can provide a reasonable representative of each behavior type, while 
the cluster sizes indicate which types are most likely. But a simple “high wind, low wind” 
breakdown, obtained by clustering on average wind level during the day, is largely uninfor¬ 
mative. For example, wind power often cannot be stored, and backup generators cannot be 
activated instantly and have ramping constraints on their output; so the shape of the wind 
speed curve has a critical effect on optimal decision-making. This shape can vary widely, 
even among days with similar average levels, as shown in Figure 
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Figure 1: Three sample days from the dataset. The mean level for each of these days is 
approximately 11 m/s, despite their different shapes. 


Methods for calculating the similarity or distance between two time series have been 
reviewed in several sources, for example Gunopulos and Das (2000). Many of these tech¬ 
niques, such as the edit distance or Dynamic Time Warping and its extensions, are designed 
to allow shifting or stretching behavior along the time axis. If we do not wish to allow time 
warping, perhaps because we are basing time-sensitive decisions on our results, we may 
approach the time series as high-dimensional data, with each time point as a dimension. 
The dimension of a discrete time series is then equal to its length: not as extreme as, for 
example, some text-matching datasets, but still high enough to require specialized method¬ 
ology. 

Much work has been done on the choice of distance functions in high-dimensional spaces. 
The intuitive choice, and a typical method, is to extend Euclidean spatial distance to higher 
dimensions using the Lp norm: for two n-dimensional observations x and y, Lp(x, y) = 

( Xir=i ■ For example, p = 2 leads to the the familiar RMSE (root mean squared 

error). This approach, however, has several drawbacks. First, it is sensitive to small 
differences in level between observations, emphasizing the mean behavior. For example, 
applying k-means clustering to the dataset yields remarkably uninformative groups, and 
has the additional disadvantage that the cluster centers, as pointwise means of all cluster 
members, are unrealistically smooth (see Figure]^. Indeed, Kusiak and Li (2010) generate 
clusters based on various parameters using k-means (in their case, in order to ht a different 
short-term predictive model of power generation for each cluster), and hnd that performing 
this clustering with wind speed does not lead to better models. 
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Figure 2: The representatives obtained by applying k-means with six clusters to the dataset. 
Mean behavior is dominant. 


The Lp norm is also dominated by observations’ behavior on dimensions, or in our case 
at times, where differences are large; if there is heteroskedasticity across times, those times 
with higher variation will tend to contribute the most to the distance calculation. [Beyer 


et ah (1999) describe the problem of loss of contrast as the number of dimensions grows 


(that is, the distance from an observation to its nearest neighbor becomes, relatively, about 
the same as the distance to its farthest neighbor), and Aggarwal et ah (2001) demonstrate 
that the Euclidean norm is particularly subject to this problem, noting that this sensitivity 
to heteroskedasticity may be to blame. In a similar vein, the Lp norm relies on absolute 
distance between elements of x and y, and may thus also be sensitive to skew in the 
distribution of observations; and, more broadly, it does not consider the observations in the 
context of the rest of the dataset. While we might attempt to adjust for heteroskedasticity 
and skew by assigning weights to the dimensions in the distance calculation, or applying 
transformations to the observations themselves, these approaches would require expert 
choices of functions and parameters. 


Aggarwal and Yu (2000) propose a method called IGrid-index based on a similarity 


score called PIDist, which manages the problem of heteroskedasticity across dimensions 
by calculating equidepth ranges on each dimension; x and y are assigned a similarity 
based on (i) the number of dimensions on which they fall into the same range, and (ii) 
their absolute proximity on each of these dimensions, as a proportion of the range. The 
similarities, and distinctions, between the PIDist function and the metric proposed here 
will be discussed further in section 2. 

To emphasize differences in shape, or to define the similarity of pairs of observations in 
a way that takes the rest of the dataset into account, a different approach is needed. Note 
that these methods could be of use in other applications where the shape of time series or 
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functional data is a concern, such as tracking meteorological data, growth curves, or the 
number of users of a system over time. Feng and Ryan (2013) explore selecting scenarios 


for the time series of future energy demand and fuel cost, in the context of generation 
expansion planning (GEP); while they use a multistage selection method instead of single- 
stage clustering, their method still relies on Ending the pairwise distances between scenarios. 

We turn to the methods of depth statistics, which are designed to express the centrality 
of observations relative to a dataset, withont reference to external scales or parameters. In 
the one-dimensional case, the depth statistic corresponds to the median and qnantiles of a 
dataset, with the median being the deepest, or most central, point. This concept can be 
extended to difierent types of data. For example, Liu (1990) gives a version, the simplicial 
depth, for the centrality of points in 3?^ relative to a distribntion in 3?^. This depth measures 
how often the observation of interest falls inside a simplex determined by p-|-1 points drawn 
from the distribution; those points which fall inside such random simplices more often are 
considered more central. 

Another extension is the band depth, developed by Lopez-Pintado and Romo (2009) to 
judge typical or atypical shapes of functional observations. Instead of simplices, Lopez- 
Pintado and Romo use bands defined by two or more observations drawn from a fixed 
dataset. At each point in the domain, the upper limit of the band is the maximnm valne 
of all the observations defining the band, while the lower limit is the minimum value of all 
these observations. In the original band depth, each observation is compared to each band; 
if the observation falls within the band’s limits at each point in the domain it is considered 
to lie within the band. Those observations falling within the most bands are considered 
the most central. 

With many datasets, however, observations successfnlly lie within very few bands, caus¬ 
ing many ties in depth between mnltiple observations. Lopez-Pintado and Romo thus in¬ 
troduce the generalized band depth. Here, instead of receiving a binary “in or out” score 
for each band, an observation receives a score for each band corresponding to the propor¬ 
tion of the domain for which the observation falls within the band’s limits. This version 
greatly rednces ties, and it allows us to use only two observations to define each band (while 
in the original version, bands defined using only two observations often failed to contain 
any other observations, reqniring the nse of more forgiving, bnt less informative and more 
compntationally expensive, three-observation bands). 

The band depth can easily be adapted for use in classification of data; for example, 
given two preexisting groups and a new observation to be classified, we could simply place 
the observation in the group where it would be more central. In itself, however, it cannot 
be used for unsupervised clustering. Since we wish to generate representatives and clusters 
from our dataset without prior knowledge, we extend the concept of the generalized band 
depth to a distance measure that yields pairwise distances between observations, allowing 
the use of any clustering method based on snch pairwise distances. 

Onr methodology is described in Section In Section we apply the method to 
simnlated data, and in Section]^ we examine the application to wind speed data, including 
transformations of the data obtained by removing typical seasonal behavior and using 
a time-freqnency representation, and obtain clustering results in each case. Section 
describes our conclusions and fnture research. Finally, the Appendix gives a proof that our 
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method defines a valid distance metric. 


2 Methodology 


Following the generalized band depth developed in Lopez-Pintado and Romo (2009), a 
band b = (bi, by) is defined by a subset of the observations; in our work, we use bands 
defined by two observations, and denote the set of all such bands by B. In the analysis 
below we assume a domain of discrete time points, though the method may be extended 
to more general domains by using an appropriate measure. (Of course, the domain need 
not be time at all, nor even have an ordering on the dimensions.) Thus, at each time point 
t, the upper and lower edges of the band are defined by the region bounded by the two 
observations. Hence a band b defined by the observations v and w has 


bi(t\v,w) = mm{v(t),w(t))-, bu(t\v,w) = max(u(t), 

An observation x lies within the band b (or b contains x) at index t if x{t) is in [be{t),bu{t)]. 
We use I^[x(t)] to denote an indicator function 


I^[xit)] 


1 if bi{t) < x{t) and x{t) < by{t) 
0 otherwise. 


For any band and observation, there is a set of time points, possibly empty, at which the 
observation falls within the band, which we denote T^{x) = {t : I^[x{t)] = 1}. The sizes 
of such sets can be used to obtain a measure of centrality for the observation x. We go 
further, however, to define a similarity score for any given band by adapting the Jaccard 
similarity. For two observations x and y and a band b, the bandwise similarity is defined 
as 

H |r^x)nr^i/)| 

Following the original Jaccard similarity, we define this quantity to be 1 when the de¬ 
nominator is zero (that is, for bands which never contain either x or y). Again follow¬ 
ing the Jaccard similarity metric, the bandwise distance is defined by subtracting the 
bandwise similarity from 1, so that d^y = 1 — s]^y. We then define the overall simi¬ 
larity between x and y to be the average of all bandwise similarities for “informative” 
bands. Specifically, let B^y be the set of all bands that contain either x oi y any in¬ 
dex, Bxy = {b:E,/" [a:(t)] -|- > 0}- W^e obtain the average of the bandwise 

similarity scores only over bands in this set, and call this the overall similarity, 

^ '\B~\ ^ 
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Likewise, the band distance between x and y, is defined as 


D 


xy 





which is equivalent to D^y = 1 — Sxy 

We also note that D^y has several desirable characteristics as a measure of distance. 

Theorem 1. D^y is a distance metric: that is, for any x, y, and z, we have D^y > 0 
(non-negativity); D^y = 0 i/ and only if x = y (identity of indiscernihles); D^y = Dy^ 
(symmetry); and D^z < D^y + Dyz (triangle ineguality). 

A complete proof is found in the Appendix. As is often the case, non-negativity and 
symmetry are trivial, and identity of indiscernihles is simple to show. The proof that D^^y 
fulfills the triangle inequality is considerably more involved. Given a band b, the bandwise 
distance dffy fulfills the triangle inequality because it is a Jaccard distance; an average of 
such distances would likewise be a distance metric. But to obtain D^y we take the average 
of these scores only over the set B^y of bands that contain either a; or y at some time, and 
the sets B^y, Byz, and B,j.z are not necessarily the same, nor the same size. 

We could consider these d^y sums as being over all possible bands, rather than only 
those in since the contribution to the sum XlbeB ^xy is zero for bands not in but 
note that we divide only by the size of B^y, not |B|, the size of the entire set of bands. 
While dividing by either quantity serves to produce distances that range conveniently 
between 0 and 1, there are several reasons for using |i?xy|- Primarily, this approach means 
that bands which never contain either observation have no effect on the distance between 
the observations (since these bands contribute to neither \Bxy\ nor can 

be considered as providing no information about the observations’ similarity. Were we to 
divide by the total number of bands |B|, each band that contained neither x nor y would 
serve to decrease D^y, since d^y for that band would be 0 while the band still contributed 
to |B|. Thus the introduction of new bands into the dataset would affect D^y even if those 
bands never contained either x or y. (Note that, with the present method of defining 
bands, the introduction of any new observation into the dataset still affects D^y through, 
at a minimum, those bands defined by the new observation and x or y. So the entirety of 
the dataset is taken into account in any event.) 

Furthermore, since d^y on a band that does contain a; or y is not zero unless x and y are 
in the band at exactly the same time points, a band containing neither observation would 
add less to the overall distance than a band containing x and y at nearly the same times. 
Yet the latter situation clearly gives more evidence for the similarity of x and y than the 
former. 

The resulting band distance has several useful properties. It is entirely data-driven, 
describing the distance between x and y in the context of the other observations in the 
dataset. It is invariant under any order-preserving operation on the observations in the 
dataset; we may apply monotonic transformations to all observations, including monotonic 
transformations which are not consistent across time. And, as we shall see in the next 
section, it treats all time points equally: those times where variation between observations 
is (in absolute terms) large do not dominate the overall calculation of distances. 
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It is interesting to note the distinctions between the band distance and the PIDist 
method of calculating high-dimensional similarity proposed by Aggarwal and Yu (2000). 
The PIDist similarity is, like the band distance, explicitly designed to take the overall 
dataset into account when looking at the similarity of observations. Like the Lp norm, 
it is a family of similarity functions indexed by p, and is calculated as follows. On each 
dimension t, divide the values of all the observations in the dataset into k equidepth groups 
(that is, groups which contain equal numbers of points). The highest and lowest observation 
values in each group generate the ranges. For two observations of interest x and y, there 
is some set S{x,y, k) of dimensions on which x{t) and y{t) are in the same range. For each 
such dimension, let be the width of the range into which x{t) and y{t) fall. Then the 
overall similarity of x and y is 


PIDist{x, y\k,p) 


E 

t&S{x,y,k) 



n 


i/p 


p 


Note that PIDist defines a similarity, and must be converted to a distance and, if 
desired, rescaled to give values between 0 and 1. It is also not shown to be a distance 
metric. Q 

The PIDist method is distinct from the band distance since the observations that dehne 
bands are not necessarily those that dehne ranges; but there are also conceptual differences. 
While the use of equidepth ranges helps deal with differing levels of variation across dimen¬ 
sions, on each dimension PIDist still uses the absolute distance between the observations, 
|x(t) — y{t)\. As a result, the method is not driven solely by the relative values of the 
observations in the dataset; it may remain sensitive to skew, and it is not invariant under 
order-preserving transformations as is the band distance. In addition, PIDist examines 
each dimension separately, whereas the band distance looks at x and y relative to the be¬ 
havior of other observations across all dimensions. Finally, the number of ranges k must 
be chosen by the user; Aggarwal notes that the optimal value appears to vary with the 
dimension of the data, and provides some suggestions for its selection, but it remains an 
externally set parameter. 


3 Simulation study 


To assess the performance of our metric, we use simulated time series data. To compare the 
performance of the different metrics, we generate a reference set containing observations of 
known types; then for each metric, we create a pairwise distance matrix, perform clustering 
with a hxed number of clusters, and calculate the Rand index (introduced in Rand (1971)) 
between the resulting classihcation and the true classihcation. The difference between the 
Rand index values for the band distance and for Euclidean distance, indicates which 
method gave a more accurate clustering, with positive values counting as a win for the 


Rand (1971 


^Many distance functions do not fulfill the necessary conditions to be a metric, but there is utility as 
well as theoretical appeal in doing so: for example, Gunopulos and Das (2000) note that many retrieval 
algorithms for finding data points similar to a given observation require the triangle inequality. 
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band distance. By repeating this process over M multiple runs, we can obtain mean and 

standard deviation for Aji, as well as a standardized z-score, Zn = AR/^{V^r{An)/M). 

Given that we can use any clustering method that relies on pairwise distances between 
observations, for the current analysis we use k-medoids. This method is similar to k-means, 
in that it is an iterative procedure; at each iteration, observations are assigned to clusters 
based on which cluster center is closest, then cluster centers are updated based on the new 
cluster memberships. In k-medoids, the cluster centers are always observations from the 
original dataset, specihcally the most central observation in each cluster (by contrast, in 
k-means, the cluster centers are the pointwise means of the cluster members). K-medoids 
accepts a pairwise distance matrix as input, and it yields sensible representatives for each 
cluster rather than the overly smoothed aggregate output of k-means. The R package 
“cluster” (see Maechler et ah (2013)) provides a tool for performing k-medoids clustering. 


which automatically selects an initial set of medoids based on the dataset. 


Simulation A. Since we are interested in examples of data with nonconstant variance, 
we simulate a set of observations from a multivariate normal distribution displaying het- 
eroskedasticity. In a simple case, we have two classes of observations, each of length 15, 
with the mean vectors 

[^i(t)] = (0.1,0.2, 0.3,..., 1.4,1.5)', [M2(t)] = (1.5,1.4,1.3,... 0.2, 0.1)'. 

We designate a nonconstant variance vector over time, constant over both classes for 
simplicity, 

[<T2(i)l = (l,l,l,l,3.5.7.9.3.2,l,l,l,l.l)' 

and a coefficient p = 0.9 to represent autocorrelation. Then, letting p = (p°, ..., p^^)', 

for both classes the covariance matrix is 

S = cr' * Toep{p) * cr, 

where Toep{p) is the Toeplitz matrix formed using the vector p. The mean and variance 
functions are shown in Figure 
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We simulate 10 observations of each type, and compare the results of clustering these 
curves with the known classihcation based on the n vector used to generate each. As 
described, we generate two pairwise distance matrices, one using Euclidean distance and one 
using band distance; then, in each case, we cluster using k-medoids with a hxed number of 
clusters. We perform the simulation M = 1000 times. The average value of is positive, 
and the standardized value is 9.58, indicating that the band distance signihcantly 
outperforms Euclidean distance. 

Simulation B. For a slightly more complicated test, we consider observations generated 
from nine different heteroskedastic multivariate normal distributions. The structure is as 
described above, with p = 0.9, but we have three different p, mean functions and three 
different cr^ variance functions, shown in Figure]^ for nine possible combinations. 




Figure 4: Simulation B: mean and variance vectors. Each combination of pu and cr^ dehnes 
a class. 

As before, we generate 10 observations of each type and run M = 1000 simulations; the 
band distance is again superior, with an even higher value of 23.9. 

Simulation C. To see how this behavior plays out in less artihcial data, we simulate time 
series from six different stationary ARMA models, and examine their periodograms. The 
six models are as follows: MA(1), 6 = 0.5; MA(2), 6 = (0.9, 0.9); MA(3), 6 = (0.8, 0.6, .2); 
AR(1), (p = 0.8; AR(2), (p = (0.3, 0.3); AR(2), cp = (0.9, —0.8), all with error variance equal 
to 1.0. We generate 15 realizations from each model, for a total sample of 90 simulated 
time series, each with 144 time points for comparability to the wind data. We then obtain 
the smoothed periodogram of each realization, using a modihed Daniell kernel; we need not 
take the short-time Fourier transform here since the underlying time series are stationary. 
The spectral density of each model, and a sample of these periodograms, are shown in 
Figure The clustering results from one simulation run are shown in Figures and 
respectively. 
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Figure 5: Simulation C: spectral densities of each model (left); sample of smoothed peri- 
odograms of simulated time series, showing three observations of each type (right). 
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Figure 6: Results of k-medoids clustering using Euclidean distance. Numbers of observa¬ 
tions of each type are given below each cluster. 
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Figure 7: Results of k-medoids clustering using band distance. Numbers of observations of 
each type are given below each cluster. 

Here, Euclidean distance commits several errors, subdividing one class and combining 
others, with a Rand index of 0.838 when compared to ground truth. Clustering based on 
the band distance, however, gives substantially more correct assignments, misclassifying 
only two observations (for a Rand index of 0.986). With M = 1000 runs, we obtain a 
value of 101.04, a convincing argument that the superior performance of the band distance 
shown above is not anomalous. 

We can ascribe the errors made with Euclidean distance in large part to the het- 
eroskedasticity of the data. The periodograms do not have equal variance at each frequency 
(the variance increases with the amplitude). The resulting large discrepancies between ob¬ 
servations, even observations of the same class, at these frequencies dominate the Euclidean 
calculation of distance. In the band distance, which relies only on the ordering of obser¬ 
vations at each point rather than their difference on an absolute scale, these frequencies 
receive no more “weight” than those at which overall variance is lower. 

4 Application: wind data 

The proposed distance metric can be applied to any high-dimensional real-valued observa¬ 
tions. We can, of course, apply it directly to the daily wind speed curves; but there are 
also certain transformations that may make the data more informative. 

4.1 Seasonal behavior 

Although wind behavior is highly variable from day to day, there are seasonal trends. For 
example. Figure shows that average wind speeds each day are higher in winter. There 
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is also a typical (though loosely followed) shape to the intra-day wind speed curve, which 
changes over the course of the year. By identifying this shape and removing it from our 
observations, we can emphasize departures from typical behavior. To this end, we £t a 
generalized additive model of the form W = s{t,c), where W represents wind speed, and 
s{t, c) is some smooth function of time of day, t, and calendar day of year, c. The model 
requires that the expected wind speed change smoothly over the course of the day, and 
also that the daily curve change smoothly over the course of the year. For the purposes 
of htting this model we have three replicates of each time-day combination, corresponding 
to the three years of observations. We £t the model using the “gam” function from the 
R package “mgcv” (see Wood (2014)), which uses Newton’s method. Figure shows the 


result. The typical day has higher wind speed at night, with a low point in the afternoon; 
this pattern is more pronounced in the summer, though overall levels are higher in winter. 



Figure 8: Left, mean wind speed by day, ow 
wind speed curves for each day of the year. 



12pm dpm 


3 years (2004-2006). Right, typical intra-day 


We can then remove this typical daily curve from our observations. Figure shows 
these typical daily curves as compared to the original wind speed observations for several 
days in June and December. We also see a comparison of original observations with the 
corresponding residual curves formed by removing the typical daily curve. The predominant 
effect is to shift the center of the observations; thus, removing the typical daily curve is 
valuable where we are concerned more with individual days’ shape than with the slow 
seasonal change in level. 
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Figure 9: Top, observations from June (solid lines) and December (dotted lines), with the 
typical daily curves shown as dashed lines. Bottom, the hrst 15 days of June (solid) and 
December (dotted) in year 1. Left, orignal observations; right, typical daily curves removed. 


4.2 Time-frequency analysis 


We may also wish to emphasize short-term variability in wind speeds. In our application, 
sharp drops and peaks in wind speed are of particular importance: for example, a day 
when wind speed increases slowly requires different decisions in the power system than a 
day when wind speed increases erratically, with sharp climbs and drops. To this end, we 
can look at wind behavior in the frequency domain, by taking the Fourier transform of the 
daily time series and examining days’ loadings on different frequencies. In this way, days 
with similar short-term erratic behavior will be marked as similar; smoothly varying days, 
with high power only on low frequencies, will likewise be similar to one another. Crucially, 
however, the wind speed series are not stationary over the course of the day; thus we use 
a time-frequency approach. By examining the short-time Fourier transforms of the wind 
speed series, we can see how frequency behavior changes across overlapping time windows. 
We use the R function “stft” from package “el071” (see Meyer et ah (2014)). In Figure [l0| 


we see two example STFTs on the log scale, one from a day with erratic changes in wind 
speed and one from a day with smooth changes. In this case we have applied the STFT to 
data with the typical daily wind curve removed. 
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Figure 10: Log-scale STFTs, above original time series from which they are calculated. 
Time windows proceed from left to right, frequencies from bottom (lowest frequency) to 
top (highest frequency). Loadings proceed from dark (low loading) to light gray (high 
loading). 


Note that the more erratic day, on the left, has signihcant scores on nearly all frequen¬ 
cies, particularly in the evening. In contrast, the day with smooth wind changes has almost 
no loading on the higher frequencies. 

By vectorizing these images, we obtain high-dimensional data to which we can apply 
our distance measure. Note that this transformation loses some information about the 
original series, since the limited number of rows in the STFT means we truncate the set 
of frequencies for which we have loadings. In addition, the overall level of the series is 
lost; here, our inputs were detrended series, but depending on the application, it may be 
advisable to include the mean as additional dimensions of the data. We may also choose 
to normalize by the overall variance within each day (perhaps including this value, too, as 
another element), so that the STFTs all have the same mean value. Then we are left with 
vectors that express only local variation. 


4.3 Clustering results 

We examine a sample of 90 observations, corresponding to the hrst 15 days of June and 
December in each year of the dataset. As above, we use k-medoids clustering and specify 
a hxed number of clusters; here we use six, corresponding to a common desired number of 
representative days in our application. First, we apply the distance metric and clustering 
directly to the wind speed time series, with results shown in Figure [TTj 
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Figure 11: Clustering results for original wind speed data, with number of June (solid) and 
December (dotted) observations in each cluster. Cluster centers (medoids) shown in bold. 

We can see that in many cases observations from the same month are clustered together, 
reflecting the different average levels in winter and summer. Using the corresponding series 
with the GAM-based intra-day trend removed yields notably different results (the Adjusted 
Rand Index between the two classihcations of the observations is only 0.126). In the latter 
case, shown in Figure clusters tend to contain a more even mix of summer and winter 
days. 
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Figure 12: Clustering results for observations with typical daily shape removed, with num¬ 
ber of June (solid) and December (dotted) observations in each cluster. Cluster centers 
(medoids) shown in bold. 


Finally, we can obtain the STFTs of these detrended days, to put maximum emphasis 
on short-term variation. Even with a low resolution, using only 14 overlapping time win¬ 
dows and 12 Fourier coefficients, the representative STFTs for each cluster show distinctly 
different behaviors (see Figure 13). For example, cluster 5 incorporates days with high- 
frequency behavior in the mid-afternoon, indicating sharply varying speeds at that time of 
day; in cluster 6, meanwhile, we see a tendency toward middle-frequency behavior at the 
beginning of the day. 
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Figure 13: Medoids obtained by clustering on STFTs of observations with typical daily 
shape removed. Time windows proceed from left to right, frequencies from bottom (lowest 
frequency) to top (highest frequency). Loadings are on the log scale, from dark (low 
loading) to light gray (high loading). 


By looking at the original data for the days falling into each cluster, shown in Figure 
Til we can see how these STFTs correspond to the original wind speed curves. While the 


observations are noisy, we can still observe the time-frequency behaviors discussed above. 
Note that overall level information is lost by using the STFTs; the emphasis here is on 
the changes in variability over time, as might be desired for making decisions about what 
power sources must be available to back up contributions from wind. 
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Figure 14: Clustering results for STFTs of observations with typical daily shape removed, 
with number of June (solid) and December (dotted) observations in each cluster. Cluster 
centers (medoids) shown in bold. 


5 Conclusion 


In this paper, we have introduced a distance metric based on the idea of the depth statistic 


(in particular, the generalized band depth of Lopez-Pintado and Romo (2009)), combined 
with the Jaccard distance measure. Our metric provides pairwise distances between high¬ 
dimensional observations, avoiding reliance on pointwise Euclidean distance and taking into 
account the behavior of the entire dataset of interest. Using the resulting pairwise distance 
matrix, we are able to perform clustering using k-medoids. 

We applied our methods to wind speed time series, drawn from the EWITS dataset, 
to obtain clusters of days demonstrating different wind behaviors as well as representative 
days of each type. To examine departures from typical behavior, we removed a smoothly 
varying typical daily curve from the observations; additionally, we used the short-term 
Fourier transforms of each day to obtain a time-frequency representation that emphasized 
local variation in wind speed. 

In the future, we will examine additional ways of extracting information from the raw 
wind speed curves, in particular ways of reducing their dimension for more effective clus¬ 
tering. For example, we can use use a constrained factor model for the time-frequency 
representations of the data, in which each day’s STFT is modeled using empirical orthogo¬ 
nal functions (constrained to be consistent across days) with different loadings. We can also 
seek covariates or recognizable meteorological characteristics associated with each cluster 
to increase interpretability. 

Finally, although we have focused on wind speed at a single site in this analysis, it 
is important to remember that a spatial component exists and is non-trivial. There is 
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strong association between wind speeds at the same site over short timeframes, and weaker 
correlation between wind speeds at different sites at approximately the same time, with a 
possible lag dependence in the prevailing wind direction. This inter-site correlation suggests 
possibilities for handling clustering across multiple locations in a more sophisticated manner 
than simply concatenating individual sites’ time series; but the nature of the correlation 
changes depending on the particular sites and on the time of day, and future work must 
take this variation into account. 

6 Appendix 

Consider a set of observations X consisting of at least three observations, and select three 
arbitrary representatives x, y, and from this set. We prove that D is a distance metric. 

Notation 

t time index 

X set of observations 

x, y, observations 

x{t) observation at time t 

N number of observations, |X| 

b band 

B set of all bands 

|B| number of bands, ( 2 ) 

huit) upper limit of band b at time t 

hi{t) lower limit of band b at time t 

I^[x{t)] indicator that observation x is in band b at time t\ 

I^[x{t)] if x{t) < bu{t) and x{t) > be{t), else I^[x{t)] = 0 
T^{x) set of times when observation x is in band b; 

T^{x) = {t : I^[x{t)] = 1} 

Bxy set of bands into which either x or y passes at any time: 

Bxy = {b ; > 0} 

s^y Jaccard similarity between x and y for band b; 

else s^y = 1 

d^y Jaccard distance between x and y for band b; 

d}^ =1 — 

'-''xy ^ ^xy 

Sxy overall similarity score for x and y. 

Q — 1 V 

\B^y\ ^xy 

(average Jaccard similarity between x and y 
over all bands containing a; or y at any time) 

Dy.y overall distance between x and y. 

n — 1 — ^ 

■‘-^xy ^xy 

or Dy,y = pW 
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Non-negativity We must show D^y > 0. To this end, note that for any b, \T^{x) fl 
T^iy)\ < |T'^(a;) U T^{y)\; thus, s'^y = is necessarily less than or equal to 1, 

and so = 1 — s'^y must be greater than or equal to zero. Since D^y is an average of 
values, it too is non-negative. 

Identity of indiscernibles We must show that D^y = 0 if and only ii x = y. 

First, snppose x = y. Then for all b, T^{x) = T^{y), and so \T^{x) nT^{y)\ = 
\T^{x) U T^{y)\. Then for all b (regardless of whether or not b ever contains x or y), 
s^y = 1 and d^y = 0, and thus D^y is also zero. 

To show the reverse implication, suppose that x ^ y. Then without loss of generality 
we can say that there is some time point t for which x{t) > y{t). 

Suppose that there exists an observation z E X such that z(t) < x(t), and consider the 
band b* dehned by 2 : and y. At t, y is in b*; but x{t) is strictly greater than both y{t) and 
z{t), so X is not in b* at t. Then \T^* (x) r]T^*{y)\ < \T^* (x) UT'’*(|/)|, so < 1 and 
d^y > 0. Since there is now a positive contribution to D^y, we know that D^y > 0. 

Now suppose there exists an observation z E X snch that z(t) > x(t), and consider the 
band b* dehned by 2 : and x. At t, x is in b* but y is not (since y{t) is less than both z{t) 
and x{t)). By similar reasoning to the above, we see that D^y > 0. Thus D^y is 0 only if 
x = y. 


Symmetry We mnst show that D^y = Dy^. This is evident from inspection of the 


dehnition; 


\T'^(x)nT^iy)\ 

\Tb{x)UTb{y)\ 


\T^{y)nT^{x)\ 

|Tb(y)UTb(x)|’ 


and B^y 


= B 


yx- 


Triangle inequality We must show that D^z < D^y + Dyz- Note that since x, y, and 
2 were selected arbitrarily, it suffices to demonstrate the ineqnality for this conhguration 
of points. 

Let Bxyz be the set of all bands that contain x or y or z at any time, and let O = \Bxyz\- 
Note that we need only be concerned with bands in B^^yz] if a band never contains either x 
or y, for example, it does not affect D^y, and so bands that are not in B^yz have no effect 
on either side of the ineqnality we are trying to show. 

We also need to denote certain subsets of Bxyz- Specihcally, let 

• B^z be the set of all bands that contain a; or 2 at any time, and 

• B_z;z be the set B^yzXBxz-, that is, the set of bands that contain y at some time but 
do not contain a; or 2 at any time. 

The sets Bxy, Byz, and so on are dehned accordingly. We use Oxy to represent the cardinality 
of sets of the type B^-y. For convenience, let p be the cardinality of Bz;z, and q be the 
cardinality of B_xz- 

Note that Bxz and B^xz partition the set Bxyz, so any relevant band will fall into exactly 
one of these subsets. Also note that O = p + q. 
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Recall the definitions. The bandwise similarity of x and y, s\.y, is: 


s 


b 

xy 


|TbnTb| 

|rbUTb| 

1 


if IT^UT^I^O 
if |TbuTb| = 0 


The bandwise distance of x and y, d^y, is simply 1 — s^y (so if neither x nor y is ever 
in the band b, then d^y = 0). Note that, considered for a particular b, d^y is the Jaccard 
distance and is itself a distance metric. 

Finally, the overall distance of x and y, D^y, is the average of the bandwise distance 
scores over all valid bands (that is, bands that contain x oi y some time): 


D 


xy 




b 

xy 


yy^Bxy 


oy E d 


b 

xy 


^^Bxyz 


since d^y = 0 for any b ^ B^y. 

Initially, we consider the case where g = 0: that is, any band that contains observation 
y at any time must also contain x or z at some time. Because d^^ < d^y + dy^ for any given 
b, we have 

E ^ E < + E <• 


Now, both Oxy and Oyz are necessarily less than or equal to O, and so the quantities 
and are both greater than or equal to one. Then we can multiply terms on the 
right-hand side of the inequality by these factors and preserve the inequality: 


E E E “i: 

hdBxyz hdBxyz bsS, 


b 

yz- 


Then we have: 


' E E <+?(- E 


O 


h&Bx 


Oy 




y^ 

^'^-Dxyz 

but since g = 0, we know that O = Oxz, so the inequality becomes 

' EE <+7^ E 


n — o ' n 

'~^xz . _ „ '~^xy . „ '~^yz 


which by definition tells us that Dxz < Dxy + Dyz, as desired. 

Now, suppose that g 7 ^ 0. We can list the bands in Bxyz as follows: 


Bxyz 


Bx: 

B_, 
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We then consider an expanded set of (non-unique) bands, in which we include p copies 
of Bxz followed by p copies of B^xz'- 


\ B 


XZ 


B 


Bxz 

B—xz 


B 


— XZJ 


And dually a set of extended bands, 
band in B: 


13 : = 


formed by appending a band from Bxz to each 


Bxz Bxz 

Bxz Bxz 
B—xz Bxz 

Bxz 

B—xz Bxz 


The hrst section of B consists of p copies of Bxz, each reduplicated to form a band of twice 
the original length. In the second section, the hrst half of each band is some band drawn 
from B-xz, and the second half of each band is drawn from Bxz- There are a total of pq 
bands in this section; each band in B_xz appears (as the hrst half of a band) p times, and 
each band in Bxz appears (as the second half of a band) q times. 

Now we examine the behavior of the bandwise distance on these sets. First, note 
that = Xlbes ^'xz since each band in Bxyz appears p times in B. 

Now, consider some band b”*" G .B. Suppose it is from the hrst section; that is, it has 
the form [b b], where b is a band in Bxz- We have: 


d 


b+ 

XZ 


1 - 


1 - 


1 - 


1 ^. 


'b+ 


n T 

X ' z 


b+| 


irr urpi 

2\T^UT^\ 


IJjnr, 


Kur.»| 


as b^ contains two disjoint repetitions of b 


Next suppose the band b+ has the form b 


b where b is drawn from Bxz and b is 
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drawn from B_xz- Then we have: 


d 


b+ 

xz 


|rr nrn 
|rr urri 

|r,»ur.b| 


d 


b 

xz"! 


as b“ contributes nothing to either or since neither x nor ever appears in b~. 

Each b G appears (p + q) times, extended in one or the other of these ways, in B. 
In either case, So 

dll = (p + q) Y 

b+es 

i.P T Q^OxzBxz- 


Now we look at the observations x and y, noting that the same reasoning will apply to 
the pair y and As before, p XlbeB^ryz dly = Xlbeh dly since each band in Bxyz appears p 
times in B. 

Consider a band b+ G -B of the form [b b], where b G Bxz- We have: 


d 


b+ 

xy 


iry nry 
|rr ury 

2\TinT^\ 

2\T^<JT^\ 

|r,»nT„‘'| 

ir^uT^i 


provided U 7 ^ 0. 


If \T^ UT^I = 0, on the other hand, then neither x nor y is ever in b, and therefore neither 
is ever in b+. In this case, then, d^,t = 0 = d^„. 

Now consider b+ G -B of the form [b~ b] where b G Bxz and b~ G B_xz- We have: 


< = 1 

= 1 
< 1 


nT^'i 
|Tr UTb+l 

\TT + 

\T^-VJT^-\ + \T^VJT^ 


as b and b are appended disjointly 
since this fraction is > 0 . 


But note that d^ = 1 — Because b G -B-x^, we know Ty 7 ^ 0, so that the 

\d- X y \ 

denominator of this fraction is not zero; but the numerator is zero since = 0. Then 
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the fraction is zero, and = 1. Thus . 

' xy d^y d^y 

Every G -B maps to a band in B of one or the other of these types. Therefore, 

Xlb+eB — Xlbes 

Combining these results, we have: 


(p + = (p + q) d^, 

^^Bxyz 

= E 

b+es 

— triangle ineq. on each 

b+eh 

= E < + E € 

b+es b+eh 

£E< + E* 

beh bes 

= p{ Y. 

P{0xyDxy T OyzDyz^ 

E p max(03;jy, Oyz) (^Dxy T 


and so 


But 


pma.yi{Oxy,Oyz) 


- (p + «)o. 


{Dxy + D yz)- 


p max{Oxy, Oy^) max(Oa;j/, Oy^) 


{p + q)Ox 


p + q 

max{Oxy, Oy^) 


as p = Ox 


O 


< 1 , 


since Bxy and By^ are both subsets of Bxyz- Then the above inequality becomes 

Dxz E ^{Dxy + Dyz) 

and so 

Oxz — Oxy T Dyz- 
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